import pandas as pd
This notebook intends to explore and describe the dataset curated of fusion peptides used in this study. The column 'seq_vfp' has the fusion peptide and the column 'Sequence_fusogenic' contains the larger fusion protein.
path = 'data/viral_fp_curated.csv'
vfp_curated = pd.read_csv(path)
vfp_curated
| idProtein | Name | Class | Activation | Name_Fusogenic_Unit | Location_Fusogenic | Sequence_fusogenic | UniProtID | NcbiID | idTaxonomy | ... | idTaxonomy.1 | CommonName | Family | Genre | Species | SubSpecies | NcbiTax | FP corrected DL | Notes DL | seq_vfp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Envelope glycoprotein | I | binding to receptor | Envelope protein p15E | 471-650 | EPVSLTLALLLGGLTMGGIAAGVGTGTTALVATQQFQQLQAAMHDD... | P03386 | NaN | 2 | ... | 2 | AKV murine leukemia virus (AKR (endogenous) mu... | Retroviridae | Gammaretrovirus | AKR murine leukemia virus | NaN | 11791 | NaN | NaN | VSLTLALLLGGLTMGGIAAGV |
| 1 | 4 | Envelope glycoprotein gp95 | I | binding to receptor | gp37 | 402-606 | STSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... | P03397 | NaN | 4 | ... | 4 | Avian leukosis virus RSA (RSV-SRA) (Rous sarco... | Retroviridae | Alpharetrovirus | Rous sarcoma virus | strain Schmidt-Ruppin A | 269446 | NaN | NaN | GPTARIFASILAPGVAAAQAL |
| 2 | 5 | Fusion glycoprotein F0 | I | binding to receptor | F1 | 103-537 | FVLGAIALGVATAAAVTAGVAIAKTIRLEGEVAAIKGALRKTNEAV... | Q2Y2M3 | NaN | 5 | ... | 5 | Avian metapneumovirus (isolate Canada goose/Mi... | Pneumoviridae | Metapneumovirus | Avian metapneumovirus | 15a | 652954 | NaN | NaN | FVLGAIALGVATAAAVTAGVAI |
| 3 | 6 | Envelope glycoprotein | I | binding to receptor | gp22 | 393-582 | AVQFIPLLVGLGITGATLAGGTGLGVSVHTYHKLSNQLIEDVQALS... | P03399 | NaN | 6 | ... | 6 | Avian reticuloendotheliosis virus | Retroviridae | Gammaretrovirus | Reticuloendotheliosis virus | NaN | 11636 | NaN | NaN | FIPLLVGLGITGATLAGGTGL |
| 4 | 7 | Envelope glycoprotein | I | binding to receptor | gp37 | 79-257 | SVSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... | P33498 | NaN | 7 | ... | 7 | Avian retrovirus RPL30 | Retroviridae | Alpharetrovirus | Avian leukosis virus | RPL30 | 31671 | NaN | NaN | GPTARIFASILAPGVAAAQAL |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 398 | 798 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGVADRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... | Q9YRV3 | NaN | 798 | ... | 798 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | Trinidad/79A/1979 | 407137 | NaN | NaN | DRGWGNGCGLFGKG |
| 399 | 799 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGITDRDFIEGVHGGTWVSASLEQDKCVTVMAPDKPSLDISLQ... | Q1X880 | NaN | 799 | ... | 799 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | isolate Uganda/A7094A4/1948 | 407139 | NaN | NaN | DRGWGNGCGLFGKG |
| 400 | 800 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGITDRDFIEGVHGGTWVSATLEQGKCVTVMAPDKPSLDISLQ... | Q074N0 | NaN | 800 | ... | 800 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | isolate Ethiopia/Couma/1961 | 407141 | NaN | NaN | DRGWGNGCGLFGKG |
| 401 | 801 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 268 - 778 | AHCIGITDRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... | Q89277 | NaN | 801 | ... | 801 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | strain French neurotropic vaccine | 407135 | NaN | NaN | DRGWGNGCGLFGKG |
| 402 | 802 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 291 - 790 | IRCIGVSNRDFVEGMSGGTWVDVVLEHGGCVTVMAQDKPTVDIELV... | Q32ZE1 | NaN | 802 | ... | 802 | Zika virus | Flaviviridae | Flavivirus | Zika virus | NaN | 64320 | NaN | NaN | DRGWGNGCGLFGKGSL |
403 rows × 26 columns
How many fusion peptide identical in the dataset. The fusion peptides identical were not removed as they may have different fusion protein associated.
dup = vfp_curated.duplicated(subset=['seq_vfp'], keep = False)
vfp_curated[dup]
| idProtein | Name | Class | Activation | Name_Fusogenic_Unit | Location_Fusogenic | Sequence_fusogenic | UniProtID | NcbiID | idTaxonomy | ... | idTaxonomy.1 | CommonName | Family | Genre | Species | SubSpecies | NcbiTax | FP corrected DL | Notes DL | seq_vfp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | Envelope glycoprotein | I | binding to receptor | Envelope protein p15E | 471-650 | EPVSLTLALLLGGLTMGGIAAGVGTGTTALVATQQFQQLQAAMHDD... | P03386 | NaN | 2 | ... | 2 | AKV murine leukemia virus (AKR (endogenous) mu... | Retroviridae | Gammaretrovirus | AKR murine leukemia virus | NaN | 11791 | NaN | NaN | VSLTLALLLGGLTMGGIAAGV |
| 1 | 4 | Envelope glycoprotein gp95 | I | binding to receptor | gp37 | 402-606 | STSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... | P03397 | NaN | 4 | ... | 4 | Avian leukosis virus RSA (RSV-SRA) (Rous sarco... | Retroviridae | Alpharetrovirus | Rous sarcoma virus | strain Schmidt-Ruppin A | 269446 | NaN | NaN | GPTARIFASILAPGVAAAQAL |
| 2 | 5 | Fusion glycoprotein F0 | I | binding to receptor | F1 | 103-537 | FVLGAIALGVATAAAVTAGVAIAKTIRLEGEVAAIKGALRKTNEAV... | Q2Y2M3 | NaN | 5 | ... | 5 | Avian metapneumovirus (isolate Canada goose/Mi... | Pneumoviridae | Metapneumovirus | Avian metapneumovirus | 15a | 652954 | NaN | NaN | FVLGAIALGVATAAAVTAGVAI |
| 4 | 7 | Envelope glycoprotein | I | binding to receptor | gp37 | 79-257 | SVSHLDDTCSDEVQLWGPTARIFASILAPGVAAAQALREIERLACW... | P33498 | NaN | 7 | ... | 7 | Avian retrovirus RPL30 | Retroviridae | Alpharetrovirus | Avian leukosis virus | RPL30 | 31671 | NaN | NaN | GPTARIFASILAPGVAAAQAL |
| 7 | 11 | Spike glycoprotein | I | binding to receptors | S2 | 750-1235 | SYSASQFQLAVLNYTSPIVVTPINSSGFTAAIPTNFSFSLTQEYIE... | Q0Q4F2 | NaN | 11 | ... | 11 | Bat coronavirus 133/2005 (BtCoV) (BtCoV/133/2005) | Coronaviridae | Betacoronavirus | Tylonycteris bat coronavirus HKU4 | BtCoV/133/2005 | 389230 | NaN | NaN | SSLLGSIAGAGWTAGLSSFAAIPFA |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 397 | 797 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGITDRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLQ... | Q1X881 | NaN | 797 | ... | 797 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | isolate Angola/14FA/1971 | 407140 | NaN | NaN | DRGWGNGCGLFGKG |
| 398 | 798 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGVADRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... | Q9YRV3 | NaN | 798 | ... | 798 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | Trinidad/79A/1979 | 407137 | NaN | NaN | DRGWGNGCGLFGKG |
| 399 | 799 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGITDRDFIEGVHGGTWVSASLEQDKCVTVMAPDKPSLDISLQ... | Q1X880 | NaN | 799 | ... | 799 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | isolate Uganda/A7094A4/1948 | 407139 | NaN | NaN | DRGWGNGCGLFGKG |
| 400 | 800 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 286 - 778 | AHCIGITDRDFIEGVHGGTWVSATLEQGKCVTVMAPDKPSLDISLQ... | Q074N0 | NaN | 800 | ... | 800 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | isolate Ethiopia/Couma/1961 | 407141 | NaN | NaN | DRGWGNGCGLFGKG |
| 401 | 801 | Genome polyprotein | NaN | binding to receptor | Envelop protein E | 268 - 778 | AHCIGITDRDFIEGVHGGTWVSATLEQDKCVTVMAPDKPSLDISLE... | Q89277 | NaN | 801 | ... | 801 | Yellow fever virus | Flaviviridae | Flavivirus | Yellow fever virus | strain French neurotropic vaccine | 407135 | NaN | NaN | DRGWGNGCGLFGKG |
257 rows × 26 columns
counts = vfp_curated['seq_vfp'].value_counts()
counts
VYPFMWGGAYCFCDTENT 12
VSLTLALLLGGLTMGGIAAGV 11
DRGWGNGCGLFGKG 11
AVGIGALFLGFLGAAGSTMGA 9
GVFVLGFLGFLATAGSAMGAA 7
..
VVVSATLCSALYVGDLCGGVMLAAQMFIVSPQ 1
MVGAAALCSAMYLGDLCGGVFLVGQLFTFRPR 1
LAATSASLFPPWTAAAGVPFY 1
KASSRSAIEDLLFDKV 1
DRGWGNGCGLFGKGSL 1
Name: seq_vfp, Length: 216, dtype: int64
counts[counts>1]
VYPFMWGGAYCFCDTENT 12
VSLTLALLLGGLTMGGIAAGV 11
DRGWGNGCGLFGKG 11
AVGIGALFLGFLGAAGSTMGA 9
GVFVLGFLGFLATAGSAMGAA 7
..
FIGAIIGGVALGVATAAQITAAAAL 2
VYPLLWGAAHCFCSTENT 2
WGCNPSDCPGVGTGCTACGLYLD 2
GVGLVIMLVIMAIVAAAGASL 2
LIGAIIGGVALGVATAAQITAASAL 2
Name: seq_vfp, Length: 70, dtype: int64
We can see that there are 216 different fusion peptides and 70 of them have more than one occurrence.
l_or = len(vfp_curated)
l_na_protein = len(vfp_curated.dropna(subset=['Sequence_fusogenic']))
print('The dataset origina has {} entries'.format(l_or))
print('This dataset contains {} entries of fusion peptides with fusion protein'.format(l_na_protein))
print('This dataset contains {} entries of fusion peptides without fusion protein'.format(l_or - l_na_protein))
vfp_uni = vfp_curated.drop_duplicates(subset=['seq_vfp'])
l_or_uni = len(vfp_uni)
l_na_protein_uni = len(vfp_uni.dropna(subset=['Sequence_fusogenic']))
print('######')
print('The dataset with just one entry per fusion peptide contains {} entries'.format(l_or_uni))
print('This dataset contains {} entries of fusion peptides with fusion protein'.format(l_na_protein_uni))
print('This dataset contains {} entries of fusion peptides without fusion protein'.format(l_or_uni - l_na_protein_uni))
The dataset origina has 403 entries This dataset contains 379 entries of fusion peptides with fusion protein This dataset contains 24 entries of fusion peptides without fusion protein ###### The dataset with just one entry per fusion peptide contains 216 entries This dataset contains 192 entries of fusion peptides with fusion protein This dataset contains 24 entries of fusion peptides without fusion protein
There are 379 entries of fusion peptides with fusion peptides. 216 if considering only fusion peptides unique. There are 24 entries without fusion protein.
import matplotlib.pyplot as plt
fig, axs = plt.subplots(2,2, figsize=(20, 15))
plt.subplot(221)
fp_len = vfp_curated['seq_vfp'].str.len()
plt.hist(fp_len)
plt.title('len of fusion peptides w/ dup')
plt.subplot(222)
fp_len = vfp_curated['Sequence_fusogenic'].str.len()
plt.hist(fp_len)
plt.title('len of fusion proteins w/ dup')
plt.subplot(223)
fp_len = vfp_uni['seq_vfp'].str.len()
plt.hist(fp_len)
plt.title('len of fusion peptides w/o dup')
plt.subplot(224)
fp_len = vfp_uni['Sequence_fusogenic'].str.len()
plt.hist(fp_len)
plt.title('len of fusion proteins w/o dup')
plt.subplots_adjust( hspace=0.2, wspace=0.2)
plt.show()
Get the sequences above the 35 aminoacids as it is not very common
vfp_curated[vfp_curated['seq_vfp'].str.len()>35]
| idProtein | Name | Class | Activation | Name_Fusogenic_Unit | Location_Fusogenic | Sequence_fusogenic | UniProtID | NcbiID | idTaxonomy | ... | idTaxonomy.1 | CommonName | Family | Genre | Species | SubSpecies | NcbiTax | FP corrected DL | Notes DL | seq_vfp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 17 | Envelope glycoprotein p57 | II | ph | Envelope glycoprotein p29 | 250-503 | DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... | P52638 | NaN | 17 | ... | 17 | Borna disease virus (strain V) (BDV) | Bornaviridae | NaN | Borna disease virus | strain V | 928296 | NaN | NaN | ILQSLLLGVFGTGIASASQFLRSWLNHPDIIGYIVNGVGVVW |
| 14 | 18 | Envelope glycoprotein p57 | II | ph | Envelope glycoprotein p29 | 250-503 | DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... | Q8BB27 | NaN | 18 | ... | 18 | Borna disease virus 1 (BoDV-1) | Bornaviridae | Orthobornavirus | Mammalian 1 orthobornavirus | 1 | 1714621 | NaN | NaN | ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW |
| 253 | 334 | HA | I | NaN | HA2 subunit | NaN | NaN | NaN | NaN | 316 | ... | 316 | Infectious salmon anemia virus (isolate Atlant... | Orthomyxoviridae | Isavirus | Salmon isavirus | isolate 810/9/99 | 652965 | NaN | NaN | GIFGAIAGFIENGWEGMVDGWYGFRHQNSEGTGQAADL |
| 270 | 559 | Pre-glycoprotein polyprotein GP complex | I | low pH | Glycoprotein G2 | 252-485 | AFFSWSLTDSSGKDTPGGYCLEEWMLVAAKMKCFGNTAVAKCNLNH... | P26313 | NaN | 558 | ... | 558 | Jembrana disease virus (JDV) | Retroviridae | Lentivirus | Jembrana disease virus | NaN | 36370 | NaN | NaN | AFFSWSLTDSSGKDTPGGYCLEEWMLVAAKMKCFGNTAV |
| 280 | 570 | Pre-glycoprotein polyprotein GP complex | I | low pH | GP2 | 260-491 | GTFTWTLSDSEGKDTPGGYCLTRWMLIEAELKCFGNTAVAKCNEKH... | P08669 | NaN | 570 | ... | 570 | Lassa virus | Arenaviridae | Mammarenavirus | Lassa mammarenavirus | strain Josiah | 11622 | GTFTWTLSDSEGKDTPGGYCLTRWMLIEAELKCFGNTAV | According to https://doi.org/10.1099/vir.0.829... | GTFTWTLSDSEGKDTPGGYCLTRWMLIEAELKCFGNTAV |
| 281 | 571 | Pre-glycoprotein polyprotein GP complex | I | low pH | GP2 | 259-490 | GTFTWTLSDSEGNETPGGYCLTRWMLIEAELKCFGNTAVAKCNEKH... | P17332 | NaN | 571 | ... | 571 | Lassa virus | Arenaviridae | Mammarenavirus | Lassa mammarenavirus | GA391 | 11621 | GTFTWTLSDSEGNETPGGYCLTRWMLIEAELKCFGNTAV | Same as prevois, but from a different strain | GTFTWTLSDSEGNETPGGYCLTRWMLIEAELKCFGNTAV |
| 283 | 574 | Pre-glycoprotein polyprotein GP complex | I | low pH | GP2 | 266-498 | GTFTWTLSDSSGVENPGGYCLTKWMILAAELKCFGNTAVAKCNVNH... | P07399 | NaN | 574 | ... | 574 | Lymphocytic choriomeningitis virus | Arenaviridae | Mammarenavirus | Lymphocytic choriomeningitis mammarenavirus | (strain WE) | 11627 | GTFTWTLSDSSGVENPGGYCLTKWMILAAELKCFGNTAV | Same as Lassa | GTFTWTLSDSSGVENPGGYCLTKWMILAAELKCFGNTAV |
| 284 | 576 | glycoprotein precursor | I | low pH | GP2 | 496 | MGQLISFFQEIPVFLQEALNIALVAVSLIAVIKGIINLYKSGLFQF... | NaN | AAS77879.1 | 576 | ... | 576 | Machupo virus | Arenaviridae | Mammarenavirus | Machupo mammarenavirus | NaN | 11628 | AFFSWSLTDSSGKDMPGGYCLEEWMLIAAKMKCFGNTAV | Same as Lassa | AFFSWSLTDSSGKDMPGGYCLEEWMLIAAKMKCFGNTAV |
| 330 | 659 | Glycoprotein G | NaN | low pH | NaN | 20-524 | KFPIYTIPDKLGPWSPIDIHHLSCPNNLVVEDEGCTNLSGFSYMEL... | P08667 | NaN | 659 | ... | 659 | Rabies virus | Rhabdoviridae | Lyssavirus | Rabies lyssavirus | strain Pasteur vaccin | 103929 | NaN | NaN | NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS |
9 rows × 26 columns
vfp_curated[vfp_curated['seq_vfp'].str.len()>40]
| idProtein | Name | Class | Activation | Name_Fusogenic_Unit | Location_Fusogenic | Sequence_fusogenic | UniProtID | NcbiID | idTaxonomy | ... | idTaxonomy.1 | CommonName | Family | Genre | Species | SubSpecies | NcbiTax | FP corrected DL | Notes DL | seq_vfp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | 17 | Envelope glycoprotein p57 | II | ph | Envelope glycoprotein p29 | 250-503 | DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... | P52638 | NaN | 17 | ... | 17 | Borna disease virus (strain V) (BDV) | Bornaviridae | NaN | Borna disease virus | strain V | 928296 | NaN | NaN | ILQSLLLGVFGTGIASASQFLRSWLNHPDIIGYIVNGVGVVW |
| 14 | 18 | Envelope glycoprotein p57 | II | ph | Envelope glycoprotein p29 | 250-503 | DTQQIEYLVHKLRPTLKDAWEDCEILQSLLLGVFGTGIASASQFLR... | Q8BB27 | NaN | 18 | ... | 18 | Borna disease virus 1 (BoDV-1) | Bornaviridae | Orthobornavirus | Mammalian 1 orthobornavirus | 1 | 1714621 | NaN | NaN | ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW |
| 330 | 659 | Glycoprotein G | NaN | low pH | NaN | 20-524 | KFPIYTIPDKLGPWSPIDIHHLSCPNNLVVEDEGCTNLSGFSYMEL... | P08667 | NaN | 659 | ... | 659 | Rabies virus | Rhabdoviridae | Lyssavirus | Rabies lyssavirus | strain Pasteur vaccin | 103929 | NaN | NaN | NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS |
3 rows × 26 columns
T not take to much in consideration the end and begining of sequences, all the fusion peptides were kept even if they exceed 35 aminoacids.
With or without suplicates the distribution is the same (however in lower frequencies). The fusion peptides are distributed around the 20 and 25 aminoacids. The fusion porteins are distributed around the 300 and 400 aminoacids.
counts = vfp_curated['Class'].value_counts()
print(counts)
counts_uni = vfp_uni['Class'].value_counts()
print(counts_uni)
fig, axs = plt.subplots(1,2, figsize=(20, 15))
plt.subplot(121)
plt.pie(counts, labels = counts.index)
plt.title('distribution of class w/ dup')
plt.subplot(122)
plt.pie(counts_uni, labels=counts.index)
plt.title('distribution of class w/o dup')
plt.legend()
plt.show()
I 261 II 78 III 27 Name: Class, dtype: int64 I 142 II 48 III 8 Name: Class, dtype: int64
counts = vfp_curated['Family'].value_counts()
print(counts)
print('#######\n \n \n')
counts_uni = vfp_uni['Family'].value_counts()
print(counts_uni)
fig, axs = plt.subplots(1,2, figsize=(20, 20))
plt.subplot(121)
plt.pie(counts, labels = counts.index)
plt.title('distribution of class w/ dup')
plt.subplot(122)
plt.pie(counts_uni, labels=counts.index) # , textprops={'fontsize': 10}
plt.title('distribution of class w/o dup')
# plt.legend(prop = {'size' : 10})
plt.subplots_adjust( hspace=0.2, wspace=0.5)
plt.show()
Retroviridae 124 Flaviviridae 69 Paramyxoviridae 36 Retroviridae 30 Togaviridae 27 Orthomyxoviridae 27 Herpesviridae 24 Filoviridae 15 Coronaviridae 14 Paramyxoviridae 13 Pneumoviridae 8 Arenaviridae 4 Peribunyaviridae 2 Bornaviridae 2 Bunyaviridae 2 Rhabdoviridae 2 Nairoviridae 1 Hepadnaviridae 1 Pneumoviridae 1 Phenuiviridae 1 Name: Family, dtype: int64 ####### Retroviridae 70 Flaviviridae 43 Paramyxoviridae 19 Retroviridae 13 Orthomyxoviridae 13 Coronaviridae 11 Paramyxoviridae 8 Togaviridae 8 Filoviridae 7 Pneumoviridae 5 Herpesviridae 5 Arenaviridae 4 Bornaviridae 2 Rhabdoviridae 2 Nairoviridae 1 Bunyaviridae 1 Hepadnaviridae 1 Peribunyaviridae 1 Pneumoviridae 1 Phenuiviridae 1 Name: Family, dtype: int64
different_fp = vfp_curated['seq_vfp'].unique()
print(len(different_fp))
different_fp = list(different_fp)
from Levenshtein import distance
import copy
def match(s1, s2):
return distance(s1, s2) #<= 1
def get_dif(different_fp, dist_accepted):
ls = []
ls_peptides = copy.deepcopy(different_fp)
for peptide_1 in ls_peptides:
count = -1 # to not cunt the peptide itself.
# ls_peptides.remove(peptide_1)
for peptide_2 in ls_peptides:
dist = match(peptide_1,peptide_2)
if dist <dist_accepted+1: # aceita 2 mudancas - 3. se aceita uma mudanca 2
count += 1
# ls_peptides.remove(peptide_2)
else:
pass
ls.append(count)
return ls
ls_dif2 = get_dif(different_fp, dist_accepted = 2) # with 2 positions changed
ls_dif1 = get_dif(different_fp, dist_accepted = 1) # with just one position changed
ls_dif3 = get_dif(different_fp, dist_accepted = 3) # with 3 position changed
print('\n number of fpep with one difference in aminoacid referent to the fpep in that list positon. \n ')
print(sorted(ls_dif1, reverse=True))
print('\n number of fpep with2 difference in aminoacid referent to the fpep in that list positon. \n')
print(sorted(ls_dif2, reverse=True))
print('\n number of fpep with 3 difference in aminoacid referent to the fpep in that list positon. \n ')
print(sorted(ls_dif3, reverse=True))
216 number of fpep with one difference in aminoacid referent to the fpep in that list positon. [5, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] number of fpep with2 difference in aminoacid referent to the fpep in that list positon. [13, 12, 10, 9, 8, 7, 7, 7, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] number of fpep with 3 difference in aminoacid referent to the fpep in that list positon. [21, 19, 18, 18, 17, 15, 15, 15, 15, 14, 13, 13, 11, 9, 9, 9, 9, 9, 8, 8, 8, 8, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
A simple function to have an overview of the number of fusion peptides that may be very similar to one another.
The function grabs a sequence and checks in the list of non duplicates if they have just one difference from the other in the list. It counts all of this sequences (this may be oversimplist) . The printed lists are sorted (not by order of dataframe).
To have an idea, there is a sequence with 5 sequences in the dataset with just one difference in aminoacid. 2 sequences with 4 sequences with one aa changed and 9 sequences with 3 similar seqs. (as the sequences are not deleted, some of these may belong to the same group)
The re are one sequence with 13 sequences with 2 aa difference.
If we go to 3 aas difference, there is one sequence with 21 sequences just differing in 3 aas.
The similarity of sequences is a thing to have into account. The next step is calcualting the bilogical similarity and aggregate this sequences using the CDHIT.
# comapre theses sequences
# cd-hit -i nr -o nr100 -c 1.00 -n 5 -M 16000 –d 0 -T 8
# cd-hit -i db -o db90 -c 0.9 -n 5 -M 16000 –d 0 -T 8
# where
# db is the filename of input,
# db90 is output,
# -c 1.0, means 100% identity, is the clustering threshold
# -c 0.9, means 90% identity, is the clustering threshold
# -n 5 is the word size
# -d 0 use sequence name in fasta header till the first white space
# -M 16000, to use 16GB RAM
# -T 8, to use 8 threads
# Choose of word size:
# -n 5 for thresholds 0.7 ~ 1.0
# -n 4 for thresholds 0.6 ~ 0.7
# -n 3 for thresholds 0.5 ~ 0.6
# -n 2 for thresholds 0.4 ~ 0.5
# protein dataset in fasta format
# output are two files: a fasta file of representative sequences and a text file of list of clusters.
print(len(vfp_curated))
# make fasta
from IO import make_fasta
make_fasta(df = vfp_curated , filename_fasta = 'fasta_curated.fasta' ,seq_column = 'seq_vfp')
# clustering 90 % similarity
!cd-hit -i fasta_curated.fasta -o results_cdhit90 -c 0.9 -n 5 -M 16000 -d 0 -T 8
print('finished')
from IO import read_fasta
# retrieves a fasta file with representative sequences
seq100 = read_fasta('fasta_curated.fasta', th = 100)
seq90 = read_fasta('results_cdhit90', th=90)
403
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i fasta_curated.fasta -o results_cdhit90 -c
0.9 -n 5 -M 16000 -d 0 -T 8
Started: Sat Dec 24 11:21:27 2022
================================================================
Output
----------------------------------------------------------------
Warning: from file "fasta_curated.fasta",
Discarding invalid sequence or sequence without identifier and description!
>nan_G
[FRWYGPKY CGYATVT]
total seq: 402
longest and shortest : 47 and 13
Total letters: 9063
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 0M
Buffer : 8 X 10M = 84M
Table : 2 X 65M = 130M
Miscellaneous : 0M
Total : 214M
Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 1973149542
# comparing sequences from 0 to 402
---------- new table with 120 representatives
402 finished 120 clusters
Approximated maximum memory consumption: 214M
writing new database
writing clustering information
program completed !
Total CPU time 0.15
finished
number of sequences with 100 % threshold: 403
number of sequences with 90 % threshold: 120
# and a cluster file
seq90 = read_fasta('results_cdhit90.clstr', th=90)
# >Cluster 0
# 0 458aa, >P0DTC2_Spike... *
# >Cluster 1
# 0 91aa, >D3IZQ2_Viral... *
# >Cluster 2
number of sequences with 90 % threshold: 120
# clustering 80 % similarity
!cd-hit -i fasta_curated.fasta -o results_cdhit80 -c 0.8 -n 5 -M 16000 -d 0 -T 8
print('finished')
# clustering 50 % similarity
!cd-hit -i fasta_curated.fasta -o results_cdhit50 -c 0.5 -n 3 -M 16000 -d 0 -T 8 # for thresholds lower word size must be increased
print('finished')
from IO import read_fasta
# retrieves a fasta file with representative sequences
seq100 = read_fasta('fasta_curated.fasta', th = 100)
seq80 = read_fasta('results_cdhit80', th=80)
seq50 = read_fasta('results_cdhit50', th=50)
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i fasta_curated.fasta -o results_cdhit80 -c
0.8 -n 5 -M 16000 -d 0 -T 8
Started: Sat Dec 24 11:21:27 2022
================================================================
Output
----------------------------------------------------------------
Warning: from file "fasta_curated.fasta",
Discarding invalid sequence or sequence without identifier and description!
>nan_G
[FRWYGPKY CGYATVT]
total seq: 402
longest and shortest : 47 and 13
Total letters: 9063
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 0M
Buffer : 8 X 10M = 84M
Table : 2 X 65M = 130M
Miscellaneous : 0M
Total : 214M
Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 1973149542
# comparing sequences from 0 to 402
---------- new table with 91 representatives
402 finished 91 clusters
Approximated maximum memory consumption: 214M
writing new database
writing clustering information
program completed !
Total CPU time 0.16
finished
================================================================
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 07 2021, 10:57:21
Command: cd-hit -i fasta_curated.fasta -o results_cdhit50 -c
0.5 -n 3 -M 16000 -d 0 -T 8
Started: Sat Dec 24 11:21:28 2022
================================================================
Output
----------------------------------------------------------------
Warning: from file "fasta_curated.fasta",
Discarding invalid sequence or sequence without identifier and description!
>nan_G
[FRWYGPKY CGYATVT]
total seq: 402
longest and shortest : 47 and 13
Total letters: 9063
Sequences have been sorted
Approximated minimal memory consumption:
Sequence : 0M
Buffer : 8 X 10M = 84M
Table : 2 X 0M = 0M
Miscellaneous : 0M
Total : 84M
Table limit with the given memory limit:
Max number of representatives: 4000000
Max number of word counting entries: 1989448902
# comparing sequences from 0 to 402
---------- new table with 40 representatives
402 finished 40 clusters
Approximated maximum memory consumption: 84M
writing new database
writing clustering information
program completed !
Total CPU time 0.08
finished
number of sequences with 100 % threshold: 403
number of sequences with 80 % threshold: 91
number of sequences with 50 % threshold: 40
From 403 sequences, 216 are repeating sequences. From this, using a cutoff of 90 % similarity (and word size 5) 120 sequences clusters with less than 90 % similarity. we have 81 sequences with less than 80 % similarity and 40 sequences with 50 % similarity (the word size here is 3)
Create protein logos (with motif from Biopython and logomaker) for sequences with different levels of similarity.
from Bio import motifs
import logomaker
def create_log(list_seqs):
maxlen = len(max(list_seqs, key = len))
seq_pad = [str(seq).ljust(maxlen, 'X') for seq in list_seqs if '['not in str(seq)]
m = motifs.create(seq_pad, alphabet= 'ACDEFGHIKLMNPQRSTVWYX')
# create Logo object
ww_logo = logomaker.Logo(pd.DataFrame(m.counts),
color_scheme='NajafabadiEtAl2017',
vpad=.1,
width=.8)
print(m.consensus)
create_log(list_seqs = seq100)
# create_log(list_seqs = seq90)
create_log(list_seqs = seq80)
create_log(list_seqs = seq50)
Warning: Character 'X' is not in color_dict. Using black. GVGAAAGLGALGGAAAATXGAXXXXXXXXXXXXXXXXXXXXXXXXXX Warning: Character 'X' is not in color_dict. Using black. GVGLAIALGALGVGAAAGAGXXXXXXXXXXXXXXXXXXXXXXXXXXX Warning: Character 'X' is not in color_dict. Using black. GLGLLLLLGGALAGAAASLAXXXXXXXXXXXXXXXXXXXXXXXXXXX
Create protein logos (with motif from Biopython and logomaker) for sequences from different families.
for family in vfp_curated['Family'].unique():
vfp_fam = vfp_uni.loc[vfp_uni['Family']==family] # using not the duplicates
list_seqs = vfp_fam['seq_vfp']
print(family)
create_log(list_seqs)
Retroviridae Warning: Character 'X' is not in color_dict. Using black. AVGLLGFLGGLLGAAGGTMGAXXXXXXXXXXXXXXXXXX Pneumoviridae Warning: Character 'X' is not in color_dict. Using black. FVEGAIALGVATAAAVTAGVAIXXXXXXX Retroviridae Warning: Character 'X' is not in color_dict. Using black. VMLALATLLSGAAAGTGATAIXX Coronaviridae Warning: Character 'X' is not in color_dict. Using black. SAATGAIADFGWFAFAXSXXXXXXX Bornaviridae Warning: Character 'X' is not in color_dict. Using black. ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW Paramyxoviridae Warning: Character 'X' is not in color_dict. Using black. FAGVVIGGAALGVATAAQITAGVALXXXXXX Peribunyaviridae Warning: Character 'X' is not in color_dict. Using black. WGCEEFGCLAVSDGCVFGSCQD Togaviridae Warning: Character 'X' is not in color_dict. Using black. VYPFMWGGAYCFCDSENTXXX Nairoviridae Warning: Character 'X' is not in color_dict. Using black. WRCNPTWCWGVGTGCTCCGLDVK Flaviviridae Warning: Character 'X' is not in color_dict. Using black. LVGAATLCSALYVGDLCGSVFLVGQLFTFSPR Filoviridae Warning: Character 'X' is not in color_dict. Using black. GAAAGLAWIPYFGPAAXXXXX Herpesviridae Warning: Character 'X' is not in color_dict. Using black. PDFFRLAFLLAVSGFAFVNAA Bunyaviridae Warning: Character 'X' is not in color_dict. Using black. WGCNPSDCPGVGTGCTACGLYLD Hepadnaviridae Warning: Character 'X' is not in color_dict. Using black. MENITSGFLGPLLVLQAGFFLLTR Orthomyxoviridae Warning: Character 'X' is not in color_dict. Using black. GLFGAIAGFIENGWEGMIDGWYGFRXXXXXXXXXXXXX Arenaviridae Warning: Character 'X' is not in color_dict. Using black. GTFTWTLSDSEGKDTPGGYCLTRWMLIAAELKCFGNTAV Pneumoviridae Warning: Character 'X' is not in color_dict. Using black. FLGLILGLGAAVTAGVALAKT Paramyxoviridae Warning: Character 'X' is not in color_dict. Using black. FIGAIIGGVALGVATAAQITAASAL Rhabdoviridae Warning: Character 'X' is not in color_dict. Using black. NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS Phenuiviridae Warning: Character 'X' is not in color_dict. Using black. WGCGCFNVNPSCLFVHT
Create protein logos (with motif from Biopython and logomaker) for sequences from different clusters.
cluster90 = 'results_cdhit50.clstr'
fasta_or = 'fasta_curated.fasta'
# >Cluster 0
# 0 47aa, >P08667_Glycoprotein... *
# >Cluster 1
# 0 42aa, >P52638_Envelope... *
# 1 42aa, >Q8BB27_Envelope... at 97.62%
# >Cluster 2
# 0 39aa, >P26313_Pre-glycoprotein... *
# 1 39aa, >nan_glycoprotein... at 94.87%
# >P03386_Envelope glycoprotein_I_0
# VSLTLALLLGGLTMGGIAAGV
# >P03397_Envelope glycoprotein gp95_I_1
# GPTARIFASILAPGVAAAQAL
# transform fasta_or in dataframe
from Bio import SeqIO
dict_or = dict()
with open(fasta_or) as fasta_file:
for seq_record in SeqIO.parse(fasta_file, 'fasta'): # (generator)
dict_or[seq_record.id] = seq_record.seq
import re
def read_cluster_file(cl_file, dict_or):
dict_cluster = dict()
key_list = []
cluster_name = 'None'
with open(cl_file) as cluster_file:
for line in cluster_file:
if line.startswith('>'):
dict_cluster[cluster_name] = key_list # from previous
# reset the name and the list
cluster_name = line # Cluster 0
key_list = []
else:
m = re.search('>(.+?)\\.', line)
name = m.group(1)
# name = 'between Z and ...' #>nan_HA_I_247...
key_list.append(dict_or[name])
dict_cluster.pop('None')
return dict_cluster
clusters = read_cluster_file(cl_file=cluster90, dict_or=dict_or)
for list_seqs in clusters.values():
create_log(list_seqs)
Warning: Character 'X' is not in color_dict. Using black. NPYPDYHWLRTVKTTKESLVIISPSVADLDPYDRSLHSRVFPGGNCS Warning: Character 'X' is not in color_dict. Using black. ILQSLLLGVFGTGIASASQFLRGWLNHPDIIGYIVNGVGVVW Warning: Character 'X' is not in color_dict. Using black. GTFTWTLSDSSGKDTPGGYCLTEWMLIAAELKCFGNTAV Warning: Character 'X' is not in color_dict. Using black. GLFGAIAGFIEGGWEGMIDGWYGFRXXXXXXXXXXXXX Warning: Character 'X' is not in color_dict. Using black. LVGAATLCSALYVGDLCGSVFLVGQLFTFSPR Warning: Character 'X' is not in color_dict. Using black. FAGVVIGGAALGVATAAQITAAIALXXXXXX Warning: Character 'X' is not in color_dict. Using black. NEESYYHIAARIATSIFALSEMGRTTEYF Warning: Character 'X' is not in color_dict. Using black. SSLLGSIAGAGWTAGLSSFAAIPFA Warning: Character 'X' is not in color_dict. Using black. MENITSGFLGPLLVLQAGFFLLTR Warning: Character 'X' is not in color_dict. Using black. WGCNPSDCPGVGTGCTACGLYLD Warning: Character 'X' is not in color_dict. Using black. NLRRIQEAGLGLANAITTVAKIS Warning: Character 'X' is not in color_dict. Using black. NYAKLRSMGYALTGAVQTLSQIS Warning: Character 'X' is not in color_dict. Using black. WGCEEFGCLAVSDGCVFGSCQD Warning: Character 'X' is not in color_dict. Using black. AVGVGGFFLGFLGAAGSTMGAX Warning: Character 'X' is not in color_dict. Using black. VSLTLALLLGGLTMGGIAAGV Warning: Character 'X' is not in color_dict. Using black. GPTARIFASILAPGVAAAQAL Warning: Character 'X' is not in color_dict. Using black. FIPLLVGLGITTAVSTGTTGL Warning: Character 'X' is not in color_dict. Using black. AVGLAIFLLVLAIMAITASLT Warning: Character 'X' is not in color_dict. Using black. FLGFLLGVGSAIASGVAVSKV Warning: Character 'X' is not in color_dict. Using black. GIGLVIVLAIMAIIAAAGAGL Warning: Character 'X' is not in color_dict. Using black.
/home/martinha/miniconda3/envs/viral_fp/lib/python3.9/site-packages/logomaker/src/Logo.py:194: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`. fig, ax = plt.subplots(1, 1, figsize=self.figsize)
TETLTTMFEVSVAFFKVGHAV Warning: Character 'X' is not in color_dict. Using black. FGISAIVAAIVAATAIAASAT Warning: Character 'X' is not in color_dict. Using black. VMLALATVLSMAGAGTGATAI Warning: Character 'X' is not in color_dict. Using black. LAATSASLFPPWTAAAGVPFY Warning: Character 'X' is not in color_dict. Using black. IGGIALGGLTSAVSIPFSLAI Warning: Character 'X' is not in color_dict. Using black. PLFWRLTGLLATSGFAFVNAA Warning: Character 'X' is not in color_dict. Using black. KDTFHLMFLFGLTHFLYSTRG Warning: Character 'X' is not in color_dict. Using black. AVPVAVWLVSALAMGAGVAGG Warning: Character 'X' is not in color_dict. Using black. GAAWGLAWIPYFGPAAXXXXX Warning: Character 'X' is not in color_dict. Using black. FVAAIILGISALIAIITSFAV Warning: Character 'X' is not in color_dict. Using black. VYPFMWGGAYCFCDTENTXXX Warning: Character 'X' is not in color_dict. Using black. RRTVEMAFAYALALFAAARQ Warning: Character 'X' is not in color_dict. Using black. PDIFLVLFQMLVAHFLVARG Warning: Character 'X' is not in color_dict. Using black. MYKTPAIKDFGGFNFSQIL Warning: Character 'X' is not in color_dict. Using black. GIPQMMFLYGIVHFSYSTK Warning: Character 'X' is not in color_dict. Using black. GTEVSEALGGAGLTGGFY Warning: Character 'X' is not in color_dict. Using black. RSARSAIEDLLFDKVXXX Warning: Character 'X' is not in color_dict. Using black. WGCGCFNVNPSCLFVHT Warning: Character 'X' is not in color_dict. Using black. DRGWGNGCGLFGKGXX